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Abstract 



A new tightly coupled speech and natural 
language integration model is presented for 
a TDNN-based large vocabulary continuous 
speech recognition system. Unlike the pop- 
ular n-best techniques developed for inte- 
grating mainly HMM-based speech and nat- 
ural language systems in word level, which 
is obviously inadequate for the morpholog- 
ically complex agglutinative languages, our 
model constructs a spoken language sys- 
tem based on the phoneme-level integra- 
tion. The TDNN-CYK spoken language ar- 
chitecture is designed and implemented using 
the TDNN-based diphone recognition mod- 
ule integrated with the table-driven phono- 
logical/morphological co-analysis. Our inte- 
gration model provides a seamless integra- 
tion of speech and natural language for con- 
nectionist speech recognition systems espe- 
cially for morphologically complex languages 
such as Korean. Our experiment results 
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show that the speaker-dependent continuous 
Eojeol (word) recognition can be integrated 
with the morphological analysis with over 
80% morphological analysis success rate di- 
rectly from the speech input for the middle- 
level vocabularies. 



1 Introduction 

A spoken natural language system requires 
many different levels of knowledge sources 
including acoustic-phonetic, phonological, 
morphological, syntactic, and semantic lev- 
els. The knowledge sources are grouped and 
processed in either speech processing models 
or statistical/symbolic natural language pro- 
cessing models. Since the speech and the nat- 
ural language communities have conducted 
almost independent researches, these mod- 
els were not completely integrated and of- 
ten biased by neglecting either the acoustic- 
phonetic or the high-level linguistic informa- 
tion. The spoken language system requires 
seamless integration of speech signals into the 
high level language processing components. 
Recent advances in large vocabulary contin- 
uous speech recognition makes an integrated 
speech and natural language system possi- 
ble and feasible. In a spoken language ar- 
chitecture, we must consider all the acoustic- 
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Figure 1: N-best search: current speech and 
natural language integration method 

phonetic and linguistic information equally 
and choose the most feasible candidates at 
each acoustic and language processing step. 

Current speech and natural language inte- 
gration mainly relies on the word-level n-best 
search techniques [1] which are obviously in- 
efficient for morphologically complex agglu- 
tinative languages such as Korean. Figure 1 
shows the current n-best integration method. 

For HMM-based speech recognition sys- 
tems, the n-best search techniques [1, 2] have 
been successfully applied to the integration 
of speech recognition systems into the natu- 
ral language systems. However, current im- 
plementations of the n-best techniques only 
support integration at the word level (using 
word sequences or lattice), and mainly used 
for the integration of existing speech and nat- 
ural language systems [3, 4]. Also the n- 
best search is viable only for short sentences 
since the n grows exponentially with the sen- 
tence length (number of words in the sen- 
tence). Because the n-best search integrates 
at the word level, the natural language sys- 
tems usually support word-level dictionary 
which seems to be a reasonable assumption 
in morphologically simple languages such as 
English. However, most natural language 
systems which deal with the morphologi- 
cally complex languages currently use the 
morpheme-level dictionary for the linguistic 
generality. For these languages, the dictio- 
nary size for large vocabulary continuous spo- 
ken language system will grow very fast if we 



adhere to the full word-level phonetic dictio- 
nary because new words can be almost freely 
generated by concatenating the constituent 
morphemes (e.g. noun + postposition or 
verb + verb-endings in Korean). To incorpo- 
rate the general morpheme-level dictionary 
into the spoken language system, we must 
develop a sub-word level integration tech- 
nique between speech and natural language. 
The technique is more important in the lan- 
guages which have very complex morpholog- 
ical structures caused by complex postposi- 
tions and verb-endings, such as Korean. 

In this paper, we present a new inte- 
gration architecture of speech and natural 
language based on the table-driven phono- 
logical/morphological co-analysis using the 
well-known dynamic programming technique 
[5] and the connectionist diphone spotting 
technique. Our model integrates a phono- 
logical/morphological parsing into a speech 
recognition, not at a word-level, but at a 
phoneme-level for a more tightly coupled 
integrated system. We present a new in- 
tegration architecture, not for the popu- 
lar HMM-based systems, but for recently 
developed connectionist speech recognition 
systems. Connectionist speech recognition 
[6] has several advantages compared with 
the classical symbolic and stochastic model- 
ing. Especially, the time-delay neural net- 
work (TDNN) model [7] has been widely 
used to model the time shift invariance of 
speech signals. However, the integrated 
speech and natural language processing mod- 
els using the TDNN have not been much re- 
searched before 1 . In this regard, we present 
a phoneme-level integration method for large 
vocabulary connectionist speech recognition 
model using the TDNN, especially for the 
morphologically complex agglutinative lan- 
guages. 



One notable exception is the researches by Sawai 
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2 Features of spoken Korean 



Korean, which can be classified into a mor- 
phologically agglutinative and syntactically 
SOV languages, has several unique linguis- 
tic features. The followings are morphologi- 
cal and phonological features of spoken Ko- 
rean for the understanding of our integration 
method. For the syntactic level features, [10] 
explains some Korean syntax modeling. In 
this paper, the Yale romanization is used for 
representing the Korean phonemes. 

1) A Korean word, called Eojeol, consists 
of more than one morphemes with clear-cut 
boundaries in between. For example, an Eo- 
jeol pha-il+tul+ul (files [obj]) consists of 3 
morphemes: 

pha-il (file) + tul (plural suffix) + ul (ob- 
ject case-marker) 

2) Korean is a postpositional language 
with noun-endings, verb-endings, and pre- 
final verb-endings. These functional mor- 
phemes determine the noun's case roles, 
verb's tenses, modals, and modification rela- 
tions between Eojeols. For example, in swu- 
ceng-ha+yess+ten pha-il (the file that was 
edited), the verb swu-ceng-ha (edit) is of past 
tense and modifies pha-il (file) according to 
the given verb-endings: 

swu-ceng-ha (edit) + yess (past tense pre- 
final verb-ending) + ten (adnominal verb- 
ending) 

3) The unit of pause in a spoken Korean 
(called Eonjeol) may be different from that 
in a written Korean (called Eojeol). For ex- 
ample, in speaking nay-ka e-cey swu-ceng-ha- 
yess-ten pha-il-tul-ul /tmp lo pok-sa-ha-ye-la 
(spaces delimit Eojeols, meaning that "copy 
the files that I edited yesterday to /tmp"), 
a person may pause after nay-ka and after 
e-cey swu-ceng-ha-y ess-ten pha-il-tul-ul, and 
after /tmp lo pok-sa-ha-ye-la. 

4) Phonological changes occur in a mor- 



pheme, between morphemes in an Eojeol, 
and between Eojeols in an Eonjeol. These 
changes include assimilation, dissimilation, 
contraction, and insertion. For example, a 
morpheme pok-sa is pronounced as pok-ssa 
(dissimilation, meaning "copy"), and kwuk- 
min is pronounced as kwung-min (assimila- 
tion, meaning "nationality"). An Eojeol su- 
ceng-ha-yess-ten is pronounced as su-ceng- 
ha-yet-tten. 



3 System ar- 

chitecture FOR SPEECH AND NAT- 
URAL LANGUAGE INTEGRATION 

Our integration technique 

employs a phoneme lattice and a morpheme- 
level phonetic dictionary. This can be more 
microscopic integration compared with the 
classical approaches of using the word lat- 
tice and the word-level dictionary, such as the 
n-best integration technique which is mainly 
used for English. The phoneme lattice makes 
the phonological rule modeling possible in 
an early stage of spoken language process- 
ing. The phonological/morphological anal- 
ysis can be performed together using the 
morpheme-level phonetic dictionary, and the 
dictionary size becomes stable regardless of 
the vocabulary size because new vocabular- 
ies can be generated by combining exist- 
ing morphemes in the dictionary. Unlike 
the conventional integration method which 
uses the separate dictionaries for the speech 
recognition and the natural language pro- 
cessing, our integration model uses a uni- 
fied morpheme-level phonetic dictionary to- 
gether with the declarative morphotactic and 
phonotactic information. In our spoken lan- 
guage architecture, we employ a hierarchy of 
diphone spotting TDNNs for the acoustic- 
level processing, and develop a phonologi- 
cal/morphological co-analysis technique for 
the seamless integration. The output of the 
integrated architecture can be directly fed 
to the conventional natural language syn- 
tax/semantics analysis systems. Figure 2 
shows the integrated spoken language pro- 
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Figure 2: TDNN-CYK integration architec- 
ture 

cessing architecture, a tightly coupled inte- 
gration model of speech and natural lan- 
guage. The speech signal is analyzed using 
the TDNN diphone recognizer. The diphone 
recognizer also rearranges the diphone strings 
to produce the phoneme lattice. From the 
phoneme lattice, the morphological analyzer 
produces the morphologically analyzed Eoje- 
ols by handling the morphological segmenta- 
tion, morphotactics verification, and the ir- 
regular conjugation. The phonological pro- 
cessing is integrated into the morphological 
parsing through the declarative phonological 
rule modeling. In the next section, we will 
explain the speech recognition and the mor- 
phological/phonological processing in detail. 



4 DlPHONE-BASED SPEECH RECOGNI- 
TION 

For large- vocabulary continuos speech recog- 
nition, the sub-word level recognition must 
be supported. We selected a group of di- 
phones for the sub-word unit because direct 
phoneme recognition in Korean is very dif- 
ficult. The 46 Korean phonemes are very 
similar each other especially in the following 
cases: 1) the Korean diphthongs are hard to 
distinguish from the mono- vowels, and 2) the 
syllable-final consonants are hard to differen- 
tiate from the syllable-first consonants. The 
selected diphone groups (figure 3) have more 
information than the phonemes and are much 
fewer in numbers than the popular triphones 



diphone 
groups 


diphone 
numbers 


V 


21 


CIV 


378 


VC2 


147 


C2C1 


126 



Figure 3: Korean diphone groups (V: vowel, 
CI: syllable-first consonant, C2: syllable- 
final consonant) 



[11]. 

Figure 4 shows the diphone-based TDNN 
speech recognition system. The system con- 
sists of total 19 different TDNN networks for 
recognition of the Korean diphone groups. 

The speech recognition is performed 
through the following steps (for more details, 
see [12]): 

1) Pre-processing: The digitized speech 
signal is segmented into 200 msec size, 512 
order FFTed and 16 step mel-scaled to ob- 
tain the filter-bank coefficients. For the end- 
point detection, the short-time energy and 
the zero-crossing rate are used. Each frame 
size is 10 msec and the 20 frames of 16 value 
normalized filter-bank coefficients are fed to 
the vowel group recognition TDNN. 

2) Vowel group recognition: The input is 
20 frame vectors (20*16 = 320 units) and the 
output is 18 units for the 18 vowel groups (the 
17 groups according to the contained vowels 
and one CC group with no vowel). For each 
vowel group, separate diphone recognition 
TDNN is invoked, and the system has a hi- 
erarchical TDNN architecture. Each TDNN 
has the standard architecture which is well 
described in [7]. 

3) Diphone recognition: According to the 
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Figure 4: Diphone-based TDNN speech 
recognition system 

recognized vowel group, each pertinent di- 
phone recognition TDNN is activated. For 
each TDNN, the input is the same 20 frame 
vectors, and the output is the classified di- 
phones for each vowel group. For example, 
for the /ya/ vowel group, there are total 
15 output units: 9 for CIV type diphones 
(/kya/, /nya/, /tya/, /lya/, /mya/, /pya/, 
/sya/, /kkya/, /hya/), 5 for VC2 type di- 
phones (/yak/, /yal/, /yan/, /yam/, /yang/) 
and one for V type diphone (/ya/). Each of 
the 18 TDNNs has the different number of 
output units according to the number of di- 
phones in each vowel group. 

4) Diphone2phoneme decoding: From the 
resulting diphone sequences, this step obtains 
the phoneme lattice which contains the can- 
didate phoneme sequences. We use a sim- 
ple deterministic decoding heuristics with- 
out any probabilistic calculations, and try 
to maintain all the possible diphone spot- 
ting results in the phoneme lattice since the 
later phonological/morphological processing 
can safely prune the incorrect recognitions. 
The decoding begins by grouping the di- 
phones into the same types (CIV, V, VC2, 
C2C1 types). The frequency count for each 
diphone, that is, the number of specific di- 



phones per 10 msec frame shift, is utilized 
to fix the insertion errors by deleting the 
lower frequency count diphones, and finally 
the diphones are split into the constituent 
phonemes by merging the same phonemes 
in the neighboring diphones. This sim- 
ple non-probablistic decoding scheme sur- 
prisingly works well for our domain, and the 
resulting phoneme lattice reliably provides all 
the possible output phonemes in the speech 
recognition. 

5 Morphological analysis from 
the phoneme lattice 

The morphological analysis transforms the 
phoneme lattice into the sequences of mor- 
phologically analyzed Eojeols (which is a unit 
of spacing in Korean orthography and usu- 
ally consists of single noun or verb-stem plus 
several functional morphemes). Our mor- 
phological analysis takes a phoneme lattice 
rather than a phoneme string as an input 
since we want to have a chance to exploit 
all the speech recognition results during the 
morphological analysis. The phoneme lat- 
tice provides alternative phonetic transcrip- 
tions of speech sounds which must be trans- 
formed to produce the orthographic mor- 
pheme strings. Unlike the conventional mor- 
phological analysis from the written text 
input, the morphological analysis of the 
phoneme lattice must solve the following sub- 
problems: 1) The phonetic transcriptions 
must be segmented and mapped into the or- 
thographic morphemes which are basic units 
of written language processing. 2) The 
phonological changes that can be captured by 
the Korean phonological rules must be mod- 
eled and processed during the morphological 
analysis. 3) An efficient dictionary search is 
required because the phoneme lattice results 
in exponential number of phoneme chains. 

The Korean morphological analyzer [13] 
was implemented based on the well-known 
CYK parsing technique [5] and augmented 
in order to handle the Korean phonological 
changes and phoneme lattice input. Figure 5 
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Figure 5: Morphological parsing of the 
phoneme lattice (from top: morphologically 
analyzed output Eojeol, CYK triangular ta- 
ble, input phoneme lattice). The example 
phoneme lattice was obtained from the in- 
put speech ci-wul-ssu (deletable) using the 
diphone-based TDNN speech recognition sys- 
tem, and the morphological analysis pro- 
duces ci-wu+l+su, where "+" is the mor- 
pheme boundary, and "-" is the syllable 
boundary. The CYK triangular table was 
filled in with all the possible morphemes 
which are obtainable from the dictionary 
look-up, and also with all the possible mor- 
pheme combinations. 



shows our morphological analysis scheme for 
the phoneme lattice. 

The basic process of the Korean morpho- 
logical analysis consists of the morpheme seg- 
mentation, checking the possible morpheme 
connectivity (handling of the morphotactics), 
and the reconstruction of the original mor- 
phemes from the irregular conjugations (han- 
dling of the orthographic rules). 

The morpheme segmentation is performed 
using the morpheme entry in the dictionary. 
During the left to right scanning of the input 
text, when the morpheme is found in the dic- 



tionary, it is enrolled in the CYK table in the 
proper position. For example, in figure 5, the 
3 different morphemes, that is, verb ci-wu, 
adnominalizing verb-ending I, and the bound 
noun swu are enrolled in the position (0,2), 
(3,3), and (4,5) respectively. The position 
(i,j) designates the start and end position of 
the input characters, and the verb ciwu starts 
in the position (first position) and ends in 
the position 2, hence consists of 3 characters. 
We enroll all the matched morphemes on the 
input string in the CYK table (see figure 5 for 
other possible morphemes). During the seg- 
mentation, the possible morpheme connectiv- 
ity must be checked for the selection of the 
correct morpheme boundaries for the input 
string. The morpheme connectivity can be 
verified from the Korean morphotactic infor- 
mation. The morphotactic information is in- 
cluded in the dictionary using the specialized 
Korean part-of-speech symbols (called con- 
nectivity information) [13]. We divided the 
major 13 Korean part-of-speech symbols into 
about 200 different refined symbols (tags) for 
the efficient verification of the connectivity 
of each morpheme, and constructed the mor- 
pheme connectivity matrix which designates 
the possible relative placement of 200 refined 
part-of-speech tags in the string. For exam- 
ple, in figure 5, the morpheme ci-wu (verb 
stem, meaning "delete") can be in the left 
side of the morpheme I (adnominalizing verb- 
ending) because the morpheme connectivity 
matrix verifies that the connection of verb 
stem to the adnominalizing verb-ending is le- 
gal. The CYK table provides the possible po- 
sitions of the connectivity checking. For ex- 
ample, in figure 5, the connectivity informa- 
tion of ci-wu and I is worth checking because 
the position (0,2) and (3,3) can be concate- 
nated to produce the position (0,3) so the re- 
sult ci-wu+l is put in the position (0,3). The 
irregular conjugations are handled in declar- 
ative way by putting the inflected forms as 
well as the original forms of the morphemes 
in the dictionary. 

The above-mentioned basic morphological 
analysis scheme was augmented to solve the 
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Figure 6: The morpheme-level phonetic dic- 
tionary. The figure shows three different 
morpheme entries ci-wu, I, swu with their 
phonetic transcription headers, orginial mor- 
phemes, left and right morphological connec- 
tivity, and left and right phonemic connectiv- 
ity information. In the actual dictionary im- 
plementation, the morphological and phone- 
mic connectivity information is encoded us- 
ing the specialized symbols. The left and 
right distinction is for the morphemes that 
have the different connectivity information to 
the left concatenated and right concatenated 
morphemes. 

three sub-problems in handling the phoneme 
lattice input and the phonological changes 
during the morphological analysis. 

1) For phonetic transcription into the or- 
thographic morpheme mapping, we indexed 
each morpheme in the dictionary by the 
corresponding phonetic transcription header, 
and constructed so called morpheme-level 
phonetic dictionary. The single phonetic 
transcription can be associated with many 
different morpheme entries for the homo- 
phone style morphemes. In this way, the 
accessing of phonetic headers can lead to 
all the corresponding morphemes in the 
orthographic forms. Figure 6 shows the 
morpheme-level phonetic dictionary. 

2) In Korean, the phonological changes 
can occur within the morpheme or across 
the morpheme boundary. For the former 
case, the phonetic transcription headers in 
the dictionary already reflect the phonolog- 
ical changes since the dictionary entry is the 
whole morpheme. However, for the latter 
case, we have to model the Korean phonolog- 
ical rules to handle the between-morpheme- 



phonological changes. We declaratively mod- 
eled the major Korean phonological rules in- 
cluding the 2nd consonant standardization, 
consonant assimilation, palatalization, glot- 
talization (consonant dissimilation), and in- 
sertion according to the Korean Ministry of 
Education Standard, and processed the Ko- 
rean phonology during the morphotactic ver- 
ification. The declarative phonological rule is 
encoded in the left and right phonemic con- 
nectivity information in the dictionary. For 
example, the bound noun swu has phonetic 
realization sswu after the I sound. Figure 6 
designates the phenomenon in the left phone- 
mic connectivity information in the swu en- 
try. The separate phoneme connectivity ma- 
trix records all the possible relative phoneme 
placement much like the morpheme connec- 
tivity matrix. When the morphotactics is 
checked, the phoneme connectivity matrix is 
also checked to verify the possible phonolog- 
ical changes between the morphemes. 

3) To handle the phoneme lattice search, 
we use the TRIE indexing for the fast dictio- 
nary access [14]. The breadth-first search on 
the TRIE structures for the phonetic tran- 
scription header can prune the unnecessary 
paths efficiently, and hence deal with the 
complexity of the phoneme lattice search. 



6 Implementation and the experi- 
ment RESULTS 

The TDNN-CYK spoken language system 
was implemented using C and standard X- 
window user interface under the UNIX/ Sun 
Sparc platforms. The system's inputs are 
carefully articulated Korean speech in the 
normal laboratory environment, and the out- 
puts are morphologically analyzed Eojeol se- 
quences that can be directly fed to the con- 
ventional natural language syntax analysis 
system [10]. We constructed a 1000 en- 
try morpheme-level phonetic dictionary in 
the UNIX operating system domain, and 
about over 100 entries of morpheme con- 
nectivity and phoneme connectivity matrix 
for the phonological/morphological analysis. 



The dictionary is indexed using the phoneme- 
based TRIE to handle the phoneme lattice 
search. Since we don't have any standard 
segmented Korean speech database yet, we 
constructed our own by recording and manu- 
ally segmenting 73 most frequent Korean di- 
phones. The 73 diphones are acquired from 
the 300 Korean Eojeols (each Eojeol is pro- 
nounced 15 times by a female speaker) in 
the 100 Korean sentences, which can appear 
in the natural language commanding to the 
UNIX operating system[15]. 

Several experiments were performed to ver- 
ify the system's performance of time-shift in- 
variance, diphone recognition, and final Eo- 
jeol recognition including the morphological 
analysis. Belows are the brief results of each 
performance test. In each experiment, the in- 
put speech patterns are prepared as follows: 
Eojeols were recorded in a normal laboratory 
environment with an average S/N ratio of 12 
dB. Speech data were sampled at 16kHz, and 
hamming windowed. From this windowed 
data, 512-point DTFTs were computed at 
5 msec intervals. The DTFTs were used to 
generate 16 Mel-scale filter-bank coefficients 
at 10 msec intervals [7]. These spectra were 
normalized to produce suitable input levels 
for the four-layer structured TDNNs. We 
used hyperbolic arc tangent error function in 
the weight updating [16] in the back propaga- 
tion training. We updated the weights after 
a small number of iterations [17]. 
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Figure 8: Diphone recognition versus 
phoneme recognition test 



6.2 Comparison of diphone recog- 
nition VS. PHONEME RECOGNI- 
TION 



6.1 Time-shift invariance of Ko- 
rean DIPHONES 

We generated 2400 diphone samples for the 
typical 12 Korean diphones. The input pat- 
terns for the two tests are set the same 
in order to compare the no shift and shift 
cases. Figure 7 shows that the Korean di- 
phone recognition has the time-invariance 
property of TDNN and suggests the optimal 
time interval near 200 - 250 msec for the di- 
phones. These results imply that the context- 
independent diphone-based TDNN recogni- 
tion is possible. 



This experiment is to show that the diphone 
can improve the recognition rate of the Ko- 
rean vowels regardless of many rising diph- 
thongs, compared with the phoneme recog- 
nition. In the test, we set 150 msec time 
range for the phoneme and 200 msec for the 
diphone segmentation. Compared with the 
phoneme recognition case, figure 8 shows that 
the recognition rate of diphones doesn't de- 
crease much when the number of targets with 
similar features doubly increases. Moreover, 
the unit with more than one feature can be 
efficiently recognized at the high rate in the 
diphone recognition. 
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6.4 Performance of continuous 
eojeol recognition 



b. segmented diphones 
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Figure 9: Continuous diphone spotting ver- 
sus segmented diphone spotting 



6.3 Performance of continuous 
diphone recognition 

In this experiment, we extracted most typi- 
cal 72 diphones in Korean from 66 Eojeols, 
each of which is pronounced 15 times to gen- 
erate about 5500 diphone patterns. The 5500 
training samples are used to train the vowel 
group TDNN and 10 different sub-TDNNs for 
each diphone group. During the recognition, 
the new 262 Eojeols are selected to generate 
the test patterns of 2432 Eojeols, and shifted 
30 msec during the application to obtain the 
TDNNs diphone spotting performances in a 
continuous speech. Figure 9-a shows the con- 
tinuous diphone spotting performance. We 
have total 7772 target diphones from the 2432 
test Eojeol patterns. The correct designates 
that the correct target diphones were spotted 
in the testing position, and the delete desig- 
nates the other case. The insert designates 
that the non-target diphones were spotted in 
the testing position. To compare the ability 
of handling the continuous speech, we also 
tested the diphone spotting using the hand 
segmented test patterns with the same 7772 
target diphones. Figure 9-b shows the seg- 
mented diphone spotting performance. Since 
the test data are already hand-segmented be- 
fore input, there are no insertion and dele- 
tion errors in this case. The fact that the 
segmented speech performance is not much 
better than the continuous one (93.8% vs. 
93.4%) demonstrates the diphone's suitabil- 
ity to handling the continuous speech. 



In order to test the ability of the full Eo- 
jeol recognition including the phoneme de- 
coding and morphological analysis perfor- 
mance, a middle-vocabulary experiment was 
carried out. The task is a speaker-dependent 
and continuous Eojeol recognition which pro- 
duces the morphologically analyzed Eojeol 
sequences. In the process, the speech recog- 
nizer produces the phoneme lattice that in- 
cludes the correct phoneme sequence in the 
input Eojeol, and then the morphological an- 
alyzer produces the analyzed Eojeol from the 
phoneme lattice. So, in this task, all the in- 
termediate steps, that is, diphone spotting, 
phoneme lattice decoding and morphological 
analysis from the phoneme lattice, are com- 
bined to produce the final recognition perfor- 
mance. The same 262 Eojeols in section 6.3 
are fed to the total integrated system that 
has the pre-trained TDNN networks. Fig- 
ure 10 shows the final performance of the con- 
tinuous Eojeols. We have total 9605 target 
morphemes from the same 2432 test Eojeol 
patterns used in section 6.3. In the figure, 
the correct designates that the correct mor- 
pheme sequences can be analyzed from the 
speech input, and the delete means that the 
correct morpheme sequences cannot be gen- 
erated. The insert designates the percentage 
of the spurious morphemes that are gener- 
ated from the insertion errors. The perfor- 
mance is above 80% in the final morpholog- 
ical analysis success rate, which is promis- 
ing but still relatively low compared with the 
continuous diphone recognition. The rela- 
tively low performance is due to the large in- 
sertion errors during the long range of contin- 
uous speech which cannot be handled prop- 
erly in the phoneme lattice decoding. How- 
ever, the morphological analyzer performed 
perfectly when the phoneme lattice contains 
the correct phoneme sequences. 



pattern size 
(rec. rate) 



total 



9605 



7696(80.1%) 



delete 
1909 (19.8%) 



7182(74.76%) 



Figure 10: Continuous Eojeol recognition in- 
cluding morphological analysis 

7 Comparison with the related 
researches 

Recently, the idea of sending only the best 
n speech recognition results to the natural 
language system has been implemented us- 
ing the time-synchronous Viterbi-style beam 
search algorithm [1]. The algorithm was also 
improved by the word-dependent search [2] 
and by adding the A* backward tree search 
[18]. The n-best integration is mainly utilized 
for the HMM-based continuous speech recog- 
nition systems, and many existing speech sys- 
tems and natural language systems were suc- 
cessfully integrated using the n-best word 
search techniques [3, 4]. However, until now, 
the n-best search techniques are only imple- 
mented to directly produce the n-best sen- 
tences using the word sequences or word lat- 
tice, and this word-level integration was suc- 
cessful for the morphologically simple lan- 
guages such as English. On the contrary, our 
integration is at the phoneme-level using the 
phoneme lattice because we need more so- 
phisticated phonological/morphological han- 
dling in the integration process. The word- 
level n-best integration also assumes the 
word-level dictionary which is an unreason- 
able assumption for the morphologically com- 
plex languages. 

The HMM-LR integration [19, 20] was im- 
plemented using the HMM's phoneme spot- 
ting ability integrated with the generalized 
LR parsing techniques [21]. Unlike the n-best 
integration, the HMM-LR integration was 
more tight and implemented at the phoneme- 
level by extending the LR parser's termi- 
nal symbols to cover the phonetic transcrip- 
tions. In this scheme, the LR parsing se- 
lects the most probable parsing results by 



obtaining the probability of the end-point 
candidate phonemes from the HMM's for- 
ward probability calculation. So the to- 
tal integrated system is working by the LR 
parser's prediction of the next phoneme can- 
didates which are then verified by the HMM's 
phoneme spotting abilities. The idea of ex- 
tending the LR grammar to the phonetic 
transcriptions seems to be working for the 
phoneme-level integration. However, the 
scheme doesn't have any separate language- 
level dictionary, which results in the degener- 
ated phonological/morphological processing, 
and also has the difficulty in the necessary 
scale-ups. On the contrary, our TDNN-CYK 
integration focuses on the general phonologi- 
cal/morphological handling which is essential 
for the agglutinative languages. 

The idea of extending LR grammar to 
the phonetic transcriptions was also ap- 
plied to the TDNN-LR integration method 
[9, 8] which was similarly implemented by 
replacing HMM's phoneme spotting by the 
TDNN's phoneme spotting. The integra- 
tion was implemented by dynamic time warp- 
ing (DTW) level-building search [22] between 
TDNN's phoneme sequences and LR gram- 
mar's phoneme sequences. However, the per- 
formance was relatively poor compared with 
the HMM-LR integration method [9]. There 
are basically two reasons for the poor TDNN- 
LR performances compared with the HMM- 
LR integration: 1) the TDNN model has 
rarely been applied to the practical large vo- 
cabulary systems yet, therefore it lacks in 
the fine tuning compared with the popular 
HMM models, and 2) the TDNN model has 
yet to find a right way to be effectively inte- 
grated into the natural language processing 
model. The HMM model supports a natural 
integration into the general chart-based pars- 
ing models such as generalized LR parsing 
because there are well-defined probablistic 
search techniques to be integrated. However, 
output activations of the multiple TDNNs 
are difficult to normalize and therefore dif- 
ficult to naturally integrate into the popular 
probabilistic search schemes such as Viterbi 



search. Our TDNN-CYK method doesn't 
employ any probabilistic search in its inte- 
gration, but send the entire phoneme lattice 
to the morphological analyzer. In this way, 
we can exploit all the TDNN's outputs in the 
language processing level which is somewhat 
inefficient but safe for the current scheme. 

8 Conclusion 

This paper presents a phoneme level in- 
tegration of speech and natural language 
in a connectionist speech recognition model 
for agglutinative languages such as Ko- 
rean. Our model's main contribution is 
to define the phoneme level integration 
that can support sophisticated phonologi- 
cal/morphological processing in the integra- 
tion of speech and language, which is es- 
sential for the morphologically complex ag- 
glutinative languages. Also, the TDNN- 
CYK integration is a first attempt to de- 
velop a morphologically general integration 
model using the connectionist speech recog- 
nition paradigm. 

Our TDNN-CYK spoken language archi- 
tecture has many novel features for speech 
and natural language processing. First, the 
diphone-based TDNN proposes a nice sub- 
word unit of recognition, well reflecting the 
Korean phonetic characteristics. Secondly, 
the morphological analysis combined with 
the declarative phonological rule modeling is 
well suited to the phonetic transcription into 
the orthographic morpheme mapping, which 
is an essential task for every spoken language 
processing model. Finally, the TRIE struc- 
tured phonetic transcription indexing can 
serve to reduce the phoneme access complex- 
ity in the direct morphological analysis from 
the phoneme lattice. 

The experiments show that the final Eo- 
jeol recognition is over 80% in the middle- 
vocabulary speaker-dependent continuous 
Eojeol recognition, which is very promising 
in considering the continuous speech and the 
combination of several steps of performances 
such as diphone spotting, phoneme lattice 



decoding and morphological analysis. How- 
ever the performance is relatively low com- 
pared with the continuous diphone recogni- 
tion (which is over 93% in the same con- 
dition) because of the enormous insertion 
errors for long duration speech (Eonjeol or 
phrase). To recover from the insertion errors, 
we plan to incorporate an error correcting 
scheme into our phoneme decoding process 
that will result in the error-free phoneme lat- 
tice from which the morphological analyzer 
can produce the perfect analysis results. 
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